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Abstract 

Mutation testing is usually regarded as an important method towards fault revealing. Despite this advantage, it has 
proved to be impractical for industrial use because of its expenses. To this extend, automated techniques are 
needed in order to apply and reduce the method's demands. Whilst there is much evidence that automated test 
data generation techniques can effectively automate the testing process, there has been little work on applying 
them in the context of mutation testing. In this paper, search-based testing is used in order to effectively generate 
test inputs capable of revealing mutants. To this end, a dynamic execution scheme capable of introducing and 
guiding the search towards the sought mutants is proposed. Experimentation with the proposed approach reveals 
its superiority from the previously proposed methods. Additionally, the framework's feasibility and practicality of 
producing mutation based test cases are also demonstrated. 

Keywords: Test case generation, Search based testing, Mutation testing 



Introduction 

Software testing can account for more than half of the 
cost of the software under development. As the main 
purpose is to reduce such an excessive cost, the testing 
activity should incorporate effective and efficient methods 
experiencing the highest possible level of automation. The 
test data generation process plays a crucial role in both 
the effectiveness and efficiency of the software testing 
phase. Unfortunately, as it is evident from the current 
practice, the level of automation achieved to date is not as 
high as it ought to be, thus resulting in a rather low qual- 
ity testing activity due to the unavoidably high cost of the 
imperative laborious manual activity. Hence, the need for 
producing the required test data automatically is essential 
in order to increase the test thoroughness and to reduce 
the testing expenses at the same time. 

Testing quality is usually measured by the test ad- 
equacy criteria. Adequacy criteria, often referred to as 
coverage criteria, pose certain requirements that should 
be fulfilled by the test cases. Mutation testing or muta- 
tion analysis, is a fault-based technique introduced by 
Hamlet (1977) and DeMillo et al. (1978). Mutation 
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analysis makes alterations, called mutants, to the code 
under test based on a set of simple syntactic rules called 
mutant operators. The purpose of injecting mutants into 
programs is to both guide the generation of test cases to 
reveal them on the one hand and to assess the test data 
quality on the other. To this extent, testing seeks to re- 
veal the mutants, which when detected are termed 
"killed" and "live" in the opposite case. Testing adequacy 
is measured using the mutation score, defined as the ra- 
tio of the number of the killed mutants to the entire 
number of candidate mutants reduced by the number of 
equivalent ones. Equivalent mutants are those mutants 
that cannot be killed by any test case. This is an analo- 
gous form of the infeasibility element problem encoun- 
tered in structural testing (Offutt and Pan 1997). 

The strength of the method relies on the hypothesis - 
ability of the introduced mutants to produce realistic 
faults. In a study made by Andrews et al. (2006), this hy- 
pothesis is reinforced. Additionally, mutation has been 
empirically found to be more effective than other struc- 
tural testing criteria (Offutt and Untch 2001) and it 
provides significant assistance in various debugging ac- 
tivities (Papadakis and Le Traon 2012). Thus, it is evi- 
dent that developers can benefit from applying mutation 
testing. Although powerful, mutation lacks practical use. 
The practical use of an adequacy criterion requires the 
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automated generation of test cases according to its re- 
quirements. This can prove to be a very laborious task 
(Offutt and Untch 2001) while, little has been done into 
automating effectively the mutation-based test produc- 
tion. Additionally, little has also been done into auto- 
mating effectively the production of the sought mutation 
based test cases. This constitutes the main issue of the 
present research. 

Automating mutation testing requires the production 
of the candidate mutant programs and their execution 
with the candidate test cases. This can be efficiently 
automated with the mutant schemata approach (Untch 
et al. 1993), (Papadakis et al. 2010), (Ma et al. 2005) 
which embeds all the candidate mutants into one sche- 
matic meta-program and thus, all tests are executed 
against this schematic program. Test execution poses an 
additional barrier to mutation analysis as it requires test 
cases to be executed against all live mutants. To effect- 
ively reduce the execution time required for mutation, 
alternative methods called weak or firm mutation (Jia 
and Harman 2010), (Howden 1982), (Offutt and Untch 
2001) have been proposed. According to these methods 
the program execution may stop after the mutated or a 
succeeding program expression. Evaluation of the mutant 
can be performed by checking the program state at the 
stopping execution statement. Thus, remarkable execution 
savings can be achieved. Additionally, by utilizing weak 
mutation and mutant schemata techniques all the weakly 
killable mutants can be recorded with only one execution 
run (Papadakis et al. 2010), (Papadakis and Malevris 
2011). The framework proposed in this paper takes advan- 
tage of this fact and executes only the weakly killed mu- 
tants in order to determine the strongly killed ones. Thus, 
the mutant execution cost is kept to a low level. 

The practical use of an adequacy criterion requires the 
automated generation of test cases according to its re- 
quirements. This can be a very laborious task (Offutt 
and Untch 2001) for any selected criterion including 
mutation. Recently, search based optimization techniques 
and tools have succeeded in automating the test case 
generation activity effectively i.e. (Harman and McMinn 
2007) and (Lakhotia et al. 2010). This paper introduces an 
automated framework that produces test cases based on 
strong mutation testing. In the proposed framework, the 
mutants are automatically generated based on a novel ver- 
sion of the mutant schemata technique (Untch et al. 1993) 
for performing both mutation and search based testing. 
The use of mutant schemata for mutation test data gener- 
ation purposes has also been investigated by (Papadakis 
and Malevris 2011), (Papadakis et al. 2010) in the context 
of weak mutation, utilizing existing structural testing tools, 
and in the context of strong mutation using dynamic sym- 
bolic execution (Papadakis and Malevris 2010a). Here the 
proposed approach incorporates a hill climbing algorithm 



known as the alternating variable method (AVM) pro- 
posed by Korel (1990) for searching and producing the 
sought test cases for strong mutation. The choice of the 
AVM method was due to its simplicity and the high 
expected effectiveness in the context of structural testing 
(Harman and McMinn 2007) and (Lakhotia et al. 2010). 

The origins of the present approach are due to the uti- 
lized dynamic fitness scheme. Thus, it becomes possible 
to effectively direct the search process towards reaching, 
infecting and impacting the targeted mutants. A performed 
case study suggests that it can be more effective than ran- 
dom testing and a previously proposed approach (Ayari 
et al. 2007). 

The contribution of the present work can be summa- 
rized into the following points: 

• A novel scheme able to perform both mutation and 
search based testing. 

• A novel fitness function for strongly killing mutants. 

• A novel dynamically adjusted fitness scheme able to 
improve the effectiveness of search based 
approaches. 

The rest of this paper is organized as follows: Section 
2 presents the proposed system by detailing the pro- 
posed approach. In Section 3 presents and analyzes the 
obtained results from the application of the proposed 
approach. Sections 4 and 5 discuss the relevance and the 
benefits of the application of the proposed system with 
previously proposed systems and approaches. Finally in 
section 6 conclusions and future directions are given. 

Framework description 

The proposed framework tries to effectively automate 
the production and evaluation of mutation based test 
data. The framework embeds all the candidate mutants 
into one schematic meta-program suitable for both exe- 
cuting mutants and recording the required test cases fit- 
ness calculations. Then it produces test cases according 
to the alternating variable method (Korel 1990) guided 
by the schematic program. In the succeeding subsections 
details of the framework are given. 

Generating mutants 

Dynamic approaches are based on the information 
gained through dynamic program runtime execution. In 
the context of structural testing the programs under test 
host all the needed information in their structure and 
thus, it is straightforward to implement a monitoring 
mechanism for the data evolution purpose. In the con- 
text of mutation, there is a special need for unifying both 
the original's and the mutants' runtime information. The 
difficulties originate from the mutations' method nature, 
as the needed information is spanned across the original 
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and the various mutants versions (Papadakis et al. 2010), 
(Papadakis and Malevris 2009) programs. To efficiently 
overcome this difficulty a special form of the mutant 
schemata technique was employed in order to unify all 
the mutation analysis requirements into a unique version 
suitable for the test evolution representation. This ap- 
proach was initially introduced in (Papadakis et al. 2010) 
in the context of using existing test data generation tools 
for performing mutation. Here the technique has been ex- 
tended in order to effectively guide the test data gener- 
ation process. This is achieved by embedding a fitness 
guide and evaluation inside the schematic functions. 

The Mutant Schemata Generation (MSG) (Untch et al. 
1993) technique encodes into one meta-program all the 
introduced mutants. This is achieved by appropriately 
replacing each pair of operands participating in an 
operation with a call to a schematic function, with this 
pair of operands as parameters (e.g. a > b becomes 
RelationalGT (a, b)). Expanding the suggestions of the 
MSG approach, the evaluation of the mutants' execution 
and all the required fitness calculations are performed 
within the schematic function. This as it is shown in 
(Papadakis et al. 2010) reflects the killing mutants prob- 
lem to a path - branch coverage problem. By incorporat- 
ing the mutant evaluation into the schematic function, 
the necessary conditions for killing each considered mu- 
tant are also embedded. These conditions, which take 
the following expression, are formed as decisions in the 
schematic function. These conditions are of the follow- 
ing form: 

Original expression* Mutated expression (1) 

The above decision expression (1) consists of two pos- 
sible outcomes (i.e. the original is either equal to the 
mutated or not). Thus, entailing the introduction of true 
(mutant is killed) and false (mutant is alive) cases. Meas- 
uring the closeness of making the above decision (1) 
true, results in an effective measure of the test case fit- 
ness according to the weak mutation testing criterion 
(Papadakis et al. 2010) and forms a part of the proposed 
fitness function (section 2.4). After the mutants' evalu- 
ation point, the program execution continues in order to 
evaluate the mutant's output and its propagation fitness, 
as it is required by strong mutation. 

Executing mutants with tests 

The present approach takes advantage of the unified 
representation of all mutants and their killing conditions 
into one meta-program. Based on the use of the intro- 
duced schemata technique, mutant execution can be 
performed straightforwardly utilizing only one program. 
This is a direct consequence of the parameterized intro- 
duced mutants (Untch et al. 1993), (Papadakis and 



Malevris 2011). Additionally, the fitness function calcula- 
tions have also been embedded into the schematic classes. 
Thus, test execution requires only an initialization of the 
mutant schemata class at the beginning of the execution 
and an additional call to the calculation function at the end 
in order to produce all the required fitness calculations. 

The proposed system employs the reflection mechan- 
ism of the Java language in order to execute the meta- 
program with the produced test cases and extracts the 
fitness function calculations. As the proposed system 
performs a different search per live mutant, it executes 
all the live mutants only when a search has come to a 
success (kills the targeted mutant). This approach might 
be less effective than executing all mutants against all 
produced tests (fitness evaluations) but keeps mutation 
execution overheads at a low level. Additionally, for fur- 
ther savings it determines the strongly killed mutants by 
executing only those that have been previously weakly 
killed, as pointed out in the introduction section. 

Search based testing 

Search based testing (Harman and McMinn 2007), 
(Wegener et al. 2001) formulates the test case gener- 
ation problem to a search problem and tries to tackle it 
using search based optimization techniques. The search 
process is guided by an appropriate fitness function 
which indicates how close the tests are in covering the 
aimed program elements. To achieve this, a separate 
search is performed according to each live mutant. To 
keep mutation execution overheads at a low level the 
framework executes all the live mutants only when a 
search has come to a success (kills the targeted mutant), 
as suggested in Fraser and Zeller (2010). 

The proposed framework uses the alternating variable 
method, proposed by Korel (1990). This method forms a 
hill climbing algorithm which has been shown to be 
quite effective compared with other search algorithms in 
the context of structural testing (Harman and McMinn 
2007). Hence, it forms an ideal choice as it is a quite 
simple to implement method and as expected, a quite 
powerful one. Here, it should be noted that mutation, in 
particular weak mutation, can be transformed to branch 
testing (Papadakis and Malevris 2011), (Papadakis et al. 
2010) and since hill climbing performs similarly to its 
rivals, in the structural testing context (Harman and 
McMinn 2007), there is no reason why this should not 
hold for mutants too. Nevertheless, this is beyond the 
scope of the present paper and is left open for future 
research. 

The method starts by randomly initializing the input 
program variable values. Then it selects repeatedly and 
adjusts one of those values by alternating it. This is 
performed until no further fitness improvement can be 
obtained i.e. no further alternations are fruitful. In this 
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case the method switches to the next input variable. The 
algorithm stops when no further fitness improvement 
can be recorded by selecting and alternating any of the 
input variables. Details of the method can be found in 
Korel (1990). 

Fitness function 

Search based testing requires the employment of an ap- 
propriate fitness function in order to be effective. The 
present framework utilizes a fitness function composed 
of four parts. The first two are known as the approach 
level and the branch distance introduced by Wegener 
et al. (2001) in the context of structural testing. The 
third one is named mutation distance and the fourth 
one is named impact distance. 

The approach level measures the closeness, of a candi- 
date test case, for executing a target mutant statement. 
It is calculated by counting the number of the target 
mutant's control dependent nodes missed by the candi- 
date input test. The branch distance quantifies the dis- 
tance from flipping a branch i.e. making it from true to 
false or the opposite. It is computed using the runtime 
values of the branch expression of interest. The expres- 
sion of interest is the topmost of the missed ones from 
the mutant control dependencies. This measure is calcu- 
lated based on the expression formulas of Table 1 which 
was taken from the Awedikian et al. (2009) study on 
MC/DC testing. Mutation distance as introduced in this 
paper reflects the branch distance measure on mutants. 
This approach is in line with the suggestions made by 
Bottaci (2001) for the mutation testing fitness calcula- 
tions. It should be noted that these three measures guide 
the search towards fulfilling the reachability and mutant 
necessity constraints proposed by Demillo and Offutt 
(1991). Table 2 presents the expression formulas based 
on which mutation distance fitness calculations were 
made. These formulas were obtained by simplifying and 
reducing the necessity constraints and provide useful in- 
formation for killing the considered mutants (based on 
the expression 1). In Table 2 the Ffit(x) and Tfit(x) 



Table 1 Branch fitness 



Expression 


True branch 


False branch 


a == b 


abs( a - b) 


a == b?k : 0 


a ! = b 


a ! = b? 0 : k 


abs (a ! = b?a - b : 0) 


a < b 


abs (a < b?0 : a - b + k) 


abs (a < b?a - b + k : 0) 


a < = b 


abs (a < = b?0 : a - b) 


abs (a < = b?a - b : 0) 


a > b 


abs (a > b?0 : a - b + k) 


abs (a > b?a - b + k : 0) 


a > = b 


abs (a > = b?0 : a - b) 


abs (a > = b?a - b : 0) 


a || b 


min[fit(a), fit(b)] 


fit(a) + fit(b) 


a && b 


fit(a) + fit(b) 


min[fit(a), fit(b)] 



signify the True and False branch fitness of clause x re- 
spectively. Fulfilling the necessity constraints has been 
found to be relatively ineffective at killing mutants that 
involve changes to predicate expressions (DeMillo and 
Offutt 1991). Thus, mutation fitness calculations should 
quantify the distance of making changes to the mutant 
and original program predicates (at the mutated state- 
ment). To achieve this it is needed to quantify the dis- 
tance of fulfilling the following expression: 

(Original pred == T Mutated pred == F)\\ 
(Original pred == F Mutated pred == T) 

Following the branch fitness calculations of Table 2, the 
fitness of the above expression named predicate mutation 
distance (pmd), is defined according to expression 2. 

pmd = min[Tfit(0) + Ffit(M), Tfit(M) + Ffit(O)} 

(2) 

The O and M denote the original and the mutant 
predicates fitness calculations. 

Impact distance tries to approximate the mutant suffi- 
ciency condition (DeMillo and Offutt 1991), which is a 
difficult task to formalize (DeMillo and Offutt 1991). 
Following the suggestions made in Fraser and Zeller 
(2010), one way of approximating this condition is to 
measure the impact on the mutant program execution. 
As this approach does not guide the search towards 
some specific program parts, able to expose the intro- 
duced mutants, it was found to be ineffective with the 
AVM method. Thus, the proposed fitness function tries 
to guide the search towards some specific program ele- 
ments which will hopefully be capable of exposing the 
mutants. To achieve this, it is suggested to record the 
impact (differences on the execution paths between the 
original and mutant program versions) (Fraser and 
Zeller 2010) of each mutant during the test generation 
process. For each program node that has been impacted 
the ratio of the killed over the total number of mutants 
is recorded. Informally, as tests are produced and exe- 
cuted with mutants the nodes are ranked according to 
their ability to expose mutants, when they are impacted. 
Impact distance reflects the approach level and the 
branch distance on the mutant program towards cover- 
ing a selected top ranked node. 

Conclusively, the proposed fitness function guides the 
search towards reaching (approach level + branch dis- 
tance) a mutant, causing a discrepancy at the mutation 
point (mutation distance), propagate it to the outcome 
of the mutant statement (predicate mutation distance) 
and impact specific likely to expose mutants, program 
nodes (here referred to as impact nodes). Computing the 
overall fitness of the test cases requires a unification of 
the three used measures. This is done based on the 
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Table 2 Mutation fitness 



Operator 


Original expression 




Mutant fitness 








a > = b.abs(a-b) 


a ! = babs(a-b + k) 


true-.abs(a-b) 




a > b 


a < b:k 


a == b.abs(a-b) 


fa\se:abs(a-b + k) 






a < = b:0 










a > b:abs(a-b) 


a ! = b.abs(a-b) 


true abs(a-b + k) 




a > = b 


a < b:0 


a == b:abs(a-b + k) 


alseabs(a-b) 






a < = b:k 










a > b:k 


a ! = b:abs(o-b + k) 


ttue-.abs(a-b) 




a < b 


a > = b:0 


a == b.abs(a-b) 


fa\seabs(a-b + k) 


Relational 




a < = b.abs(a-b) 








a > b:0 


a ! = b.abs(a-b) 


true:abs(a-b + k) 




a < = b 


a > = b:k 


a == b:abs(a-b + ty 


false: abs(a-b) 






a < b:ubs(a-b,) 










a > b:a6sfa-b + kj 


a < = b.abs(a-b) 


true-.abs(a-b) 




a ! = b 


a > = b:abs(a-b.J 


a == b:0 


false:k 






a < b:abs(a-b + k) 










a > babsfo - b) 


a < = b:abs(a-b + k) 


true:k 




a == b 


a > = b:abs(a-b + k) 


a ! = b:0 


false:obsfa-W 






3 < babs(a - b) 








a + b 


a - b:fc 


a / b:k 


a:k 




a * b:k 


a% bJc 


b:k 




a - b 


a + b:k 


a / b:k 


a:k 




a * b:Ac 


a% b:k 


b;k 


Arithmetic 


a * b 


a + b:k 


a / b:k 


a:k 


a - b:fc 


a% b:k 


b:k 




a / b 


a + bk 


a * b:k 


a:k 




a — d.k 


a 70 D.K 


D.K 




a% b 


a + b:k 


a*b:k 


a:k 




a - b:k 


a / b:k 


b:k 


Absolute 


a 


abs(a):abs(a + k) 


-abs (a):absW 


Oabs(a) 




a && b 


a\\b:min[Tfit(a) + Ffit(b), Ffit(a) + Tfit(b)] 


a.Tfit(a) + Ffit(b) 


truewi/n ffft'tfa;, FffffbjJ 


Logical 






bFfit(a) + Tfit(b) 


falseTfitfal + ffiffbj 




a&&b:min[Tfit(a)+ 


aFfit(a) + Tfit(b) 


trueFft'tfa; + Ffit(b) 




a || b 


Ffit(b), Ffit(a) + Tfit(b)] 


bTfit(a) + Ffit(b) 


fa\se:min[Tfit(a), Tfit(b)] 



following equation where branch and mutation distances 
are normalized as in (Arcuri 2010): 

fitness=reach dis+mutation dis+impact dis 

reach dis=2*approach level+normalized(hranch distance), 

mutation dis=normaIized{mutation distance) 

+normalized(pmd) 
impact dis=approach level+normalized(branch distance) 

(3) 

Dynamic approach level 

Mutation testing introduces a vast number of mutants 
which are spanned across the whole program structure. 



It was observed that trying to kill them, results into cov- 
ering - reaching many other mutants (possible hard to 
reach) collaterally i.e. without aiming at them. It is noted 
that many mutants are equivalent and thus by their def- 
inition aiming at them will result in a waste of effort. In 
practice, these two characteristics of mutation can pro- 
vide useful information to assist the killing of some 
other mutants. This paper proposes the concept of dy- 
namic approach level that will serve as a yardstick for 
improving the search process. 

Search based approaches utilizing approach level and 
branch distance fitness functions have the drawback of 
leading to random testing (McMinn and Holcombe 
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2006) for certain types of programs. This is due to the 
use of certain programming constructs such as the use 
of flags, enumeration types and various data dependen- 
cies (Awedikian et al. 2009). To avoid such a situation, 
the dynamic approach level tries to include some data 
dependencies in the fitness evaluation. These data de- 
pendencies are not included in the "static" predefined 
approach level. The need for such an inclusion has also 
been pointed out by Awedikian et al. (2009), who also 
argued that doing so is quite easy and leads to an im- 
proved performance. 

The rationale behind the use of the standard approach 
level (Wegener et al. 2001) is to include only the struc- 
tural elements (control dependencies) that must be tra- 
versed by any of the possible sought test cases. Consider 
a case where in order to traverse a targeted branch re- 
quires the program execution to execute a specific pro- 
gram statement (data dependency) that is not part of the 
control dependencies of the targeted branch. Then, all 
the possible test cases that traverse this branch also 
traverse the specific program statement. The dynamic 
approach level identifies all the common structural ele- 
ments that traverse the produced test cases and thus, 
necessary data dependencies too. This way the path in- 
formation gained during the whole search process can 
be used and reproduced for infecting and eventually kill- 
ing the aimed mutants. 

The dynamic approach level is defined as the intersec- 
tion of all nodes that are contained in all the encoun- 
tered execution paths that reach a targeted node. Thus, 
for example if a target node is x and during the search 
process 5 different execution paths have been encoun- 
tered that lead to node x, then the dynamic approach 
level is formed as the common nodes of these 5 paths. If 
there is no path leading to node x, then the standard ap- 
proach level is used. It is noted that in the absence of 
data dependencies the dynamic approach level could ap- 
proximate the standard approach level if most of the en- 
countered paths have been executed. This approach 
relies on the excessive search performed to kill all the in- 
troduced mutants. 



particular weak mutation, can be transformed to branch 
testing (Papadakis and Malevris 2009), (Papadakis et al. 
2010), (Papadakis and Malevris 2011), (Papadakis and 
Malevris 2012) and since hill climbing performs similarly 
to its rivals, in the structural testing context (Harman 
and McMinn 2007), there is no reason why this should 
not hold for mutants too. Nevertheless, this is beyond 
the scope of the present paper and is left open for future 
research. 

The method starts by randomly initializing the input 
program variable values. Then it selects repeatedly and 
adjusts one of those values by alternating it. This is 
performed until no further fitness improvement can be 
obtained i.e. no further alternations are fruitful. In this 
case the method switches to the next input variable. The 
algorithm stops when no further fitness improvement 
can be recorded by selecting and alternating any of the 
input variables. Consider the example of Figure 1. To 
make this example more understandable, let us assume 
that when a mutant is weakly killed, it is also strongly 
killed. The same approach holds and in the opposite 
case with the difference in the fitness calculations. In the 
left part of Figure 1 the original sample program is 
presented. In its right part the mutated meta-program is 
detailed. The introduction of the mutants is recorded in 
the alterations made to the original program e.g. the 
statement if ( i < k ) has become if (RelationalGT(i, k, 
15)). The variables i and j are the two original operand 
variables while 15 signifies that this expression contains 
the mutants identified by the relational operator (7 mu- 
tants) with identification numbers from 15 to 21. Let the 
initial random inputs be: i = 150, j = 400, k = 300 and the 
target mutant the 15th one i.e. ( i < k to i < = k with mu- 
tant fitness abs(i-k)). The process at first selects the i in- 
put variable and performs exploratory steps (small 
increases and decreases say p - here 1 for integer and 
0.1 for real variables - of the input variable). These steps 
indicate the search direction. In the example here, i 
should be increased as it results in better fitness values. 
After the determination of the search direction the 



The mutation AVM method 

The proposed approach uses the alternating variable 
method, proposed by Korel (1990). This method forms a 
Hill climbing algorithm which has been shown to be 
quite effective compared to other search algorithms in 
the context of structural testing (Harman and McMinn 
2007) and has also been incorporated to automated test 
data generation tools such as the AUSTIN tool (Lakhotia 
et al. 2010) for structurally testing. Hence, it forms an 
ideal choice as it is a quite simple to implement method 
and as it is also expected to be a quite powerful one. 
Here, it should be noted that mutation testing, in 



Mutatest (int i, int j, int k){ 


Mutatest (int i, int j, intk){ 


int ret = 0; 


int ret = 0; 


if(i>j) 


if (RelationalGT(i, j, 1)) 


if(j<k) 


if (RelationalLT(j, k, 8)) 


if ( i < k ) 

) 


if(RelationalL(i,k,15)) 

} 


Original program 


Mutated program 



Figure 1 Demonstrating Example. 
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process continues with pattern steps (these steps are 
computed based on the formula: 2 A n*direction*p, where 
direction is 1 for increase or -1 for decrease). Thus, in 
the above example the next obtained input values (pat- 
tern steps) will be for the i variable 152, 154, 158, 166, 
182, 214, 278, 406. At this point the fitness function can- 
not be further improved by altering the i input variable 
as the fitness also relies on the second branch point ( j < 
k ). The process continues with input variable j, it per- 
forms exploratory steps and starts to decrease the j value 
as follows: 398, 396, 392, 384, 368, 336, 272. At that 
point the process chooses the k input variable and starts 
increasing its value accordingly to 302, 304, 308, 316, 
332, 364, 428. After value 428 it performs exploratory 
steps again and starts to decrease its value to 426, 424, 
420, 412, 396. Here it changes direction again and con- 
tinues to 398, 400, 404, 412 where it decreases to 410, 
408, 404 and finally finds the required value 406 that 
kills the mutant. The process has effectively achieved to 
produce the test case ( i = 406, j = 272, k = 406 ) that kills 
the mutant ( i < k to i < = k ). If this procedure fails to 
kill the required mutant the process restarts by using 
new randomly selected inputs for i, j and k. Of course, 
this could be a consequence of hitting a local minimum 
or a consequence of an equivalent mutant. 

Evaluation 

This section empirically evaluates the effectiveness of 
the propositions made in this paper based on two ex- 
periments. The first experiment compares the effective- 
ness of the proposed framework to perform mutation 
using three fitness functions and random testing. The 
second experiment, examines the impact on the frame- 
work's effectiveness to when utilizing the dynamic 
approach level. 

Experimental design 

The experiment described in this section uses the pro- 
posed mutation testing framework on a set of Java pro- 
grams using the mutation operators presented in Table 2 
along with the incorporated fitness necessity formulas. 
The framework works on Java programs (produces mu- 
tation operators) with a primary target at the intra 
method level (Ma et al. 2005), similar to non Object- 
Oriented languages (see section V.A for details about 
the framework capabilities and limitations). The experi- 
ment described in this section empirically investigates 
the following Research Questions (RQs): 

• RQ 1: How effective are the adopted fitness 
functions compared to a previously proposed one 
(Ayari et al. 2007) and random testing? 

• RQ 2: What is the relative efficiency of the adopted 
fitness functions? 



• RQ 3: What is the impact of the dynamic approach 
level on the effectiveness of the examined 
approaches? 

To answer the above questions, the proposed frame- 
work was employed to generate test cases for a set of 
programs based on the mutation operators presented in 
Table 2. It is noted that the results reported here are the 
average values obtained from applying the examined ap- 
proaches 10 times independently. In order to answer 
RQ1 the number of killed mutants was measured. With 
respect to RQ2 the required fitness evaluations to pro- 
duce the sought test data were measured. With respect 
to RQ3 the experiment was repeated by utilizing the dy- 
namic approach level. Specifically, in the conducted ex- 
periments random testing and three fitness functions 
were utilized. The first fitness function named "Reach" 
uses only the reach distance of expression 3 and corre- 
sponds to the fitness function suggested by Ayari et al. 
(2007). The second one called "Infect" uses the Reach 
and Infect distances of expression 3 and the third one 
named "Impact" utilizes expression 3. In the second ex- 
periment the followed process was to iteratively and 
continuously perform one attempt to kill all the live mu- 
tants using the standard approach level and one using 
the dynamic approach level. For each test subject the 
test generation process considered up to 50,000 fitness 
evaluations or random test generations (for random test- 
ing) for all the introduced mutants. This is a consider- 
ably high number of evaluations but it is forced due to 
the existence of equivalent mutants. 

Results and analysis 

The experimental evaluation of the proposed system was 
performed based on the use of 8 program units. The se- 
lected programs have been used in various studies 
(DeMillo and Offuttel991), (Papadakis et al. 2010), 
(Sthamer 1996), (Murrill 2008). The test objects are pro- 
grams with various characteristics such as mathematical 
computations, array manipulations, state based behavior 
and complex branching conditions. A considerable num- 
ber of mutants, 2,759, were produced based on the use 
of all the operators employed by the framework. Particu- 
lars of all the used programs are given in Table 3. Table 3 
records details about the test objects' lines of code, input 
settings (input domain search space) and the number of 
produced mutants. 

The experiment tries to reveal the ability of the frame- 
work to perform mutation testing and its effectiveness 
compared to a previously proposed approach without 
any particular assistance. That is, none of the equivalent 
mutants was eliminated from the candidate mutant set, 
fact that allows considerable overheads to the conducted 
experiment. Furthermore, no data dependencies, state 
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related information or flag removal approaches were 
employed in order to make the test search more effi- 
cient. The effective incorporation of such approaches is 
considered out of the scope of the present paper and 
thus, has been left for future work. 

Results of the performed experiments are presented in 
Table 4. Table 4 records per test subject the number of 
killed mutants by the produced test data according to 
random testing (Random) and the three employed fit- 
ness functions utilizing the standard (Reach, Infect and 
Impact) and the dynamic approach levels (DReach, 
DInfect and DImpact). Additionally, Figure 2 reports the 
sum of the killed mutants for various fitness evaluation 
limits when using either static or dynamic approach level. 

The obtained results provide evidence in support of 
the proposed fitness functions (Infect and Impact) which 
outperform a previously proposed one (Reach) and ran- 
dom testing (RQ1). Additionally, the use of dynamic 
approach level improves the effectiveness of all the ex- 
amined fitness functions (RQ3). Specifically, the Infect 
and Impact fitness functions kill on average 113 and 122 
more mutants than the Reach one respectively, using the 
standard approach level. The use of the dynamic ap- 
proach level results in an increase with all three exam- 
ined functions by killing 71, 40 and 87 more mutants for 
Reach, Infect and Impact fitness functions respectively. 
Additionally, the convergence of all the examined fitness 
functions is higher for the high number of evaluations. 
This is due to the fact that for higher number of execu- 
tions more paths are included in the dynamic approach 
level. Recall that the dynamic approach level is adopted 
according to all the encountered execution paths. 

Considering the approach efficiency (RQ2) the number 
of mutant evaluations should be examined. From Figure 2 
it can be observed that both the Infect and Impact fitness 
are more efficient than the Reach one even for a small 
number of evaluations (approximately over 4,500 evalua- 
tions). For less than 4,500 evaluations all the approaches 
share a similar effectiveness and efficiency. The use of dy- 
namic approach level generally improves the efficiency of 
the utilized fitness functions as for the same number of 
evaluations it kills more mutants. 

Table 3 Test program details 



Compared with random testing it can be observed that 
in general it performs worse than the Infect and Impact 
fitness irrespective of the use or not of the dynamic ap- 
proach level. However, it performs similarly to Reach 
without the use of the dynamic approach level and 
worsens when Reach utilizes it. Here it must be noted 
that under the framework's process of executing mu- 
tants, which determines the collaterally killed ones when 
and only when it has killed a targeted mutant, the com- 
parison made is in favor of random testing (in random 
testing all tests are executed against all mutants). Never- 
theless, even in such a case the proposed approach out- 
performs random testing. 

By employing the proposed framework with 150,000 
fitness evaluations on the Trityp program (DeMillo and 
Offutt 1991), which is a well established benchmark in 
both search based and mutation testing studies, the re- 
sults presented in Figure 3 can be obtained. From this 
figure it becomes evident that the Impact fitness utilizing 
the dynamic approach level can lead to a considerably 
high number of killed mutants. In this case it manages 
to kill all but two of the killable mutants. This fact sug- 
gests that the proposed approach can be quite powerful 
in producing mutation based test cases. 

Related work 

The alternating variable method was initially proposed 
by Korel (1990) which was adopted by the present 
framework for finding the appropriate tests. Daimler 
(Wegener et al. 2001) developed an automated tool for 
testing C programs based on various structural testing 
criteria. It is this tool's fitness function that is extended 
by the present research. 

Test case generation techniques for mutation testing 
have received little attention in the literature. The most 
fundamental attempt is the one due to DeMillo and 
Offutt who developed the constraint based testing 
Technique (DeMillo and Offutt 1991). This technique 
introduced the concept of reachability, necessity and 
sufficiency conditions which have been embodied in a 
tool called Godzilla. Godzilla contains the first two of 
these conditions, formulating and resolving them as 



Test subjects 


Lines of code 


Input settings 


No. of mutants 


1 -Triangle 


35 


3 ints: (range 16-bit) 


166 


2-Trityp 


40 


3 ints: (range 16-bit) 


349 


3-Triangle 


90 


3 ints: (range 16-bit) 


421 


4-Remainder 


50 


2 ints: (range 16-bit) 


324 


5-Callendar 


75 


5 ints: 2x[0, 12], 2x[0, 365], [-3,000, 3,000] 


327 


6-Cancel 


50 


3 ints: (3x[0, 50]) 


866 


7-FourBalls 


30 


4 floats, 1 ints:4x[-100, 100], [-100, 100] 


225 


8-Quadratic 


25 


3 ints (range 16-bit) 


81 
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Table 4 Mutants killed by the utilized fitness functions 



Test subjects 


Random 


Reach 


Infect 


Impact 


DReach 


Dlnfect 


Dlmpact 


1 -Triangle 


702.2 


94 


703 


703.4 


96.4 


703 


703.2 


2-Trityp 


725.6 


173.8 


778.4 


784.8 


205.4 


270.4 


223 


3-Triangle 


702 


131 


744.4 


746.2 


143.8 


748.6 


785 


4-Remainder 


205.8 


201.4 


206 


206 


201.4 


206 


206 


5-Callendar 


189 


165 


195.2 


793.2 


168.6 


798.8 


200 


6-Cancel 


712.6 


686.2 


732.2 


732.6 


709.26 


732 


733.2 


7-FourBalls 


187.2 


183.2 


185 


186.8 


181 


785.8 


188 



8-Quadratic 59.07 58 61.22 61.8 58 60.6 63 



mathematical systems of constraints. Formulating and 
resolving reachability and necessity constraints forms a 
difficult task. In order to efficiently handle this task, in 
(Papadakis and Malevris 2012) and (Papadakis and 
Malevris 2009) it is suggested to use a path selection 
strategy that reduces the effects of infeasible paths. 
Bottaci (2001) proposed a fitness function composed of 
the reachability distance (measures the closeness of the 
test data and the mutant statement) of the produced 
tests and the necessity distance (measures the closeness 
to kill the mutant statement). In (Ayari et al. 2007) a 
search based approach for the generation of mutation 
test data was proposed by implementing only the 
reachability part of the Bottaci (2001) fitness function. 
More recently, Fraser and Zeller (2010) proposed an- 
other evolutionary based approach to automate the pro- 
duction of mutation tests. This approach uses the 
rechability part of the Bottaci's fitness function (Bottaci 
2001) and approximates the necessity and sufficiency 
conditions by measuring the mutant's impact (Fraser 
and Zeller 2010). They argue that producing tests with 
higher mutants' impact, results in tests closer to kill 
those mutants. The above two approaches are the clos- 
est ones to the present proposed framework. The main 
differences are that the proposed framework extends 
the fitness function to effectively direct the search to- 
wards necessity and sufficiency conditions (DeMillo and 
Offutt 1991). Additionally, a novel technique to efficiently 



gain and dynamically adopt the required fitness informa- 
tion through mutant schemata is presented by the present 
paper. 

The idea of utilizing mutant schemata in order to help 
automated tools to perform mutation was initially intro- 
duced in (Papadakis et al. 2010) with the aim of reusing 
existing structural automated tools for performing muta- 
tion. The underlying technique to achieve this was to re- 
duce the weakly killing mutant problem to the covering 
branches one. In (Papadakis and Malevris 2010a) the 
schemata approach was extended to utilize dynamic 
symbolic execution for producing strong mutation based 
tests. In the present paper, mutant schemata were used 
in order to enable the fitness guidance towards killing 
mutants for strong mutation. 

Discussion 

The origin of the proposed framework is due to the inte- 
grated use of mutant schemata and evolutionary testing 
techniques utilizing a novel fitness function. This inte- 
gration helps to extract dynamic program information 
concerning the introduced mutants and fitness calcula- 
tions efficiently. The conducted case study indicates the 
ability of the proposed method to produce high quality 
test cases from scratch (starting from random inputs). 
Additionally, it indicates the improved performance over 
random testing and a previously proposed approach. In 
fact the proposed fitness functions are compared with 
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Tritype: Using "static" approach level 



Tritype: Using "dynamic" approach level 
r- 




0 SOOOO 100000 150000 

Figure 3 Trityp program: Mutants killed vs fitness evaluations. 
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the one proposed in (Ayari et al. 2007). A similar to 
(Ayari et al. 2007) approach has also been proposed by 
Fraser and Zeller (2010) who extend it by including the 
mutants impact in their fitness. Such an inclusion was 
attempted in the conducted case study. The obtained re- 
sults were similar to the ones obtained by the reach fit- 
ness function. This is due to the use of hill climbing and 
the absence of effective guidance towards some specific 
program statements. 

Equivalent mutants help the proposed approach to 
build the dynamic approach level as they force the 
search towards various and different program statements 
and conditions. Despite this, from the conducted case 
study it becomes evident that equivalent mutants pose 
an additional burden to test case evolution as they force 
the method not only to search for non killable mutants 
(huge effort) but also by misleading the mutation score 
calculated due to their presence. This fact explains why 
the proposed approach spends so many fitness evalua- 
tions in order to kill additional mutants. Perhaps the use 
of some heuristic approaches such as the one suggested 
in (Kintis et al. 2012) for isolating equivalent mutants, 
could be employed in order to overcome this problem. 
This paper also reveals that simple dynamic approaches 
can be quite effective for the production of high quality 
test cases. Based on the dynamic nature of the adopted 
approach, the problems caused by pointers and non lin- 
ear expressions are limited. 

The framework described here uses a quite simple but 
practical approach, based on the mutant schemata tech- 
nique (Untch et al. 1993), in order to perform the test 
data generation process. This approach is an incremental 
approach that targets first on reaching, then weakly kill- 
ing and then strongly killing the introduced mutants. 
This incremental process helps on producing some tests 
capable to weakly kill some strongly equivalent mutants. 
These tests should be valuable and should increase the 
performed testing quality. 



Tool characteristics and limitations 

The proposed framework in this paper has several spe- 
cial characteristics and limitations which are currently 
under research. Generally, it can handle Java programs 
only at the intra method level. Thus, it does not handle 
method sequences or Object Oriented features. Its appli- 
cation treats one method at a time using predefined 
method sequences. The inability of the mutant schemata 
technique to handle certain Object Oriented mutants as 
identified in (Ma et al. 2005) limits the propositions 
made in this paper to the intra method level. Here it 
must be noted that in the case of Logical operators, a 
necessary special handling was enforced. This is due to 
the short circuit evaluation mechanism performed by 
the Java language. In order to keep the program execu- 
tion paths unaffected with the presence of mutants, the 
logical operator's evaluations were performed when both 
logical operands were executed. 

Threats to validity 

The present paper focuses on presenting an automated 
mutation testing system. One possible threat to the val- 
idity of the results reported here may be related to the 
generalization of the obtained results. Thus, the frame- 
work's effectiveness may vary in other cases. However, as 
the proposed method utilizes and extends the sugges- 
tions made by DeMillo and Offutt (1991), Bottaci (2001) 
and Fraser and Zeller (2010), their use is expected to in- 
crease the effectiveness of the search based approaches. 
Additionally, the results obtained may serve as a yard- 
stick towards the employment of mutation testing in an 
automated fashion and in order to indicate that it is pos- 
sible to adopt mutation for the testing activity, resulting 
in the production of high quality tests. 

The proposed framework uses weak/strong mutation 
and mutants' impact for guiding and evaluating its test 
production effectiveness utilizing the AVM method. This 
does not necessarily mean that the results here can be 
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extrapolated to the rest of the search based approaches. 
In any case, this was not the intention of the present 
framework, as it aims at using AVM which is quite sim- 
ple and practical. 

The last threat of internal validity may be based on the 
use of software systems. Thus, possible bugs, the fitness 
measures and the schematic program production imple- 
mentation of the test objectives may have influenced the 
obtained results. To reduce these threats, selected test 
cases were executed in both the original and the converted 
program versions showing that they execute the same pro- 
gram paths. An additional manual evaluation of the results 
produced by the framework based on the Tritype program 
was performed showing no discrepancies. 

Conclusion and future work 

The proposed framework, as described here forms an 
automation of the mutation testing method. The frame- 
work uses state of the art techniques to efficiently gener- 
ate the candidate mutants and produce mutation based 
test data. Based on a performed case study the system 
achieves to produce test cases able to kill the majority of 
the introduced mutants. This also establishes tests for 
performing high quality testing, this being the main 
issue of the present paper. Preliminary results suggest 
that the proposed fitness functions can outperform a 
previously proposed one and random testing as well. 
Also the use of the dynamic approach level can increase 
the effectiveness of the framework. In particular, based 
on the conducted case study, the suggested propositions 
achieve to kill on average approximately 7.6% more mu- 
tants than a previously proposed approach (Ayari et al. 
2007) and 7.9% more than what random testing does. 

In future, extensions of the framework to include 
other search based approaches such as evolutionary test- 
ing are planned. Further investigation is needed in order 
to determine the benefits of the dynamically adopted 
approach level and its optimal use on search based test- 
ing. Finally, the application of the proposed approach in 
killing second or higher order mutants (Papadakis and 
Malevris 2010b), (Kintis et al. 2010), (Jia and Harman 
2010) is under investigation. Since such approaches have 
been shown to be quite effective in isolating equivalent 
mutants (Kintis et al. 2012) their consideration within 
the proposed framework will greatly enhance the level of 
automation used when performing mutation testing. 
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